vignettes/2_usage.Rmd
2_usage.RmdThe simplest way to use scigenex is to run the following steps using the Seurat R package:
The subsequent object can be used as input to SciGeneX. Alternatively, you can provide a normalized matrix as input.
For this tutorial, we will use scRNA-seq dataset of Peripheral Blood
Mononuclear Cells (PBMC) freely available on 10x Genomics web site. This
dataset contains 2700 single cells sequenced on the Illumina NextSeq
500. This dataset can be downloaded from
10X Genomics or through the
SeuratDatalibrary
In this step, we will run the classical pre-processing steps from the. Please refer to this tutorial for more informations. If you have already pre-processed your data with Seurat or have a normalized count matrix as input, you can skip this step.
library(ggplot2)
library(patchwork)
library(Seurat, quietly = TRUE)
library(SeuratData)
InstallData("pbmc3k")
data("pbmc3k")We next runthe classical steps of the seurat pipeline. For more information, you can check Seurat website here.
# Quality control
pbmc3k[["percent.mt"]] <- PercentageFeatureSet(pbmc3k, pattern = "^MT-")
pbmc3k <- subset(pbmc3k, subset = percent.mt < 5 & nFeature_RNA > 200)
# Normalizing
pbmc3k <- NormalizeData(pbmc3k)
# Identification of highly variable genes
pbmc3k <- FindVariableFeatures(pbmc3k, selection.method = "vst", nfeatures = 2000)
# Scaling data
pbmc3k <- ScaleData(pbmc3k, features = rownames(pbmc3k), verbose = FALSE)
# Perform principal component analysis
pbmc3k <- RunPCA(pbmc3k, features = VariableFeatures(object = pbmc3k), verbose = FALSE)
# Cell clustering
pbmc3k <- FindNeighbors(pbmc3k, dims = 1:10, verbose = FALSE)
pbmc3k <- FindClusters(pbmc3k, resolution = 0.5, verbose = FALSE)
# Dimension reduction
pbmc3k <-suppressWarnings(RunUMAP(pbmc3k, dims = 1:10, verbose = FALSE))
DimPlot(pbmc3k, reduction = "umap")
In this section, we use the previously generated Seurat object as an input to run SciGeneX main commands. This command run the main algorithm that will:
First we will load the Scigenex library. To limit the verbosity of Scigenex function we will set the verbosity level to zero (which will switch off information messages and debugging messages).
library(scigenex)
scigenex::set_verbosity(0)Then we call successively the select_genes() function
which will select informative genes (i.e co-regulated), then
the gene_clustering() function which will call MCL and
partition the dataset into gene modules.
# Select informative genes
pbmc_scigenex <- select_genes(pbmc3k,
k = 80,
distance_method = "pearson",
which_slot = "data",
row_sum=2)
# Run MCL
pbmc_scigenex <- gene_clustering(pbmc_scigenex,
k = 5,
threads = 2,
inflation = 2)The object produced is a ClusterSet objet that is a S4
object that is intented to store gene modules.
isS4(pbmc_scigenex)## [1] TRUE
pbmc_scigenex## An object of class ClusterSet
## Name: s8gZ5uYQ33
## Memory used: 97239768
## Number of cells: 2643
## Number of informative genes: 2266
## Number of gene clusters: 232
## This object contains the following informations:
## - data
## - gene_clusters
## - top_genes
## - gene_clusters_metadata
## - gene_cluster_annotations
## - cells_metadata
## - dbf_output
## - parameters
## * distance_method = pearson
## * k = 80
## * noise_level = 5e-05
## * fdr = 0.005
## * row_sum = 2
## * no_dknn_filter = FALSE
## * seed = 123
## * keep_nn = FALSE
## * k_mcl_graph = 5
## * output_path = /var/folders/zy/wl3dj2_n76zfc8sdvny1q06c0000gn/T//RtmpNqEch3
## * name = s8gZ5uYQ33
## * inflation = 2
There are various methods associated with the ClusterSet
objects.
## [1] "[" "%in%"
## [3] "clust_names" "clust_size"
## [5] "cluster_set_to_xls" "cluster_stats"
## [7] "col_names" "dim"
## [9] "enrich_go" "gene_cluster"
## [11] "grep_clust" "nclust"
## [13] "plot_clust_enrichments" "plot_ggheatmap"
## [15] "plot_markers_to_clusters" "rename_clust"
## [17] "reorder_clust" "reorder_genes"
## [19] "row_names" "show"
## [21] "top_by_go" "top_genes"
## [23] "viz_enrich" "which_clust"
The current object contains 2266 informative genes, 2643 samples and 232 gene modules.
nrow(pbmc_scigenex)## [1] 2266
ncol(pbmc_scigenex)## [1] 2643
nclust(pbmc_scigenex)## [1] 232
At this step, several modules need to be filtered out as lots of them
may be singletons. Interestingly, the ClusterSet class
implements the indexing operator/function (“[”). The first
argument/dimension of the indexing function corresponds to the cluster
stored in the object. The second dimension correspond to cell names. As
an example, one can simply keep gene modules with size (i.e
number of genes) greater than 7 using the following code. This leads to
an object containing 48 gene modules.
pbmc_scigenex <- pbmc_scigenex[clust_size(pbmc_scigenex) > 7, ]
nclust(pbmc_scigenex)## [1] 48
It may also be important to filter out gene based on dispersion.
Several parameters can be computed for each cluster using the
cluster_stats()function.
plot_cluster_stats(cluster_stats(pbmc_scigenex)) +
ggplot2::theme(axis.text.y = ggplot2::element_blank(),
axis.text.x = element_text(angle=45, vjust = 0.5),
panel.grid = element_blank()) 
Here we will select gene modules based on standard deviation (> 0.1) and rename the cluster (from 1 to the number of clusters):
pbmc_scigenex <- pbmc_scigenex[cluster_stats(pbmc_scigenex)$sd_total > 0.1, ]
pbmc_scigenex <- rename_clust(pbmc_scigenex)
nclust(pbmc_scigenex)## [1] 17
The we check the statistics again.
plot_cluster_stats(cluster_stats(pbmc_scigenex)) +
ggplot2::theme(axis.text.y = ggplot2::element_blank(),
axis.text.x = element_text(angle=45, vjust = 0.5),
panel.grid = element_blank()) 
Clusters of genes are stored in the gene_clusters slot.
One can access the gene names from a cluster using the
get_genes() command. By default, all genes are
returned.
# Extract gene names from the 5th gene cluster
genes_module_5 <- get_genes(pbmc_scigenex, cluster = 5)
head(genes_module_5)## [1] "CD79A" "MS4A1" "CD79B" "HLA-DQA1" "TCL1A" "LINC00926"
One can also access the gene to cluster mapping using
get_genes().
head(gene_cluster(pbmc_scigenex))## CST3 TYROBP AIF1 LST1 LYZ FTL
## 1 1 1 1 1 1
tail(gene_cluster(pbmc_scigenex))## SDCBP2 RP4-781K5.2 C1orf198 SSFA2 ZGLP1 WBP5
## 17 17 17 17 17 17
The visualization of a heatmap containing all cells and modules may
be time consuming and may require important memory ressources. A first
alternative is to look at gene cluster individually or a subset of
clusters. The gene modules can be visualized using the
plot_heatmap() or the plot_ggheatmap()
functions. The first is mainly intented to propose interactive
vizualisation based on the iheatmapr library and allows to
easily browse the results and zoom on particular regions of the heatmap
The second should be mostly used for non interactive vizualisation. This
second solution leads to a plot that is more easilly customable as it is
bas ed on the ggplot framework. Here we use
plot_ggheatmap() to have a look at the 4 first clusters.
Note that here, we choose to order columns/cells on
Seurat::FindClusters results.
plot_ggheatmap(pbmc_scigenex[1:4, ],
use_top_genes = FALSE,
ident=Idents(pbmc3k)) + ggtitle("Cluster 1") 
However one alternative is to extract the most representative genes
of each cluster. This can be achieved using the top_genes()
function. This function will store the identifiers of these
representative genes inside the top_genes slot of the
ClusterSet object. The top_genesslot can be
accessed the get_genes() function.
pbmc_scigenex <- top_genes(pbmc_scigenex)
genes_cl5_top <- get_genes(pbmc_scigenex, cluster = 5, top = TRUE)
genes_cl5_top## [1] "CD79A" "MS4A1" "HLA-DQA1" "CD79B" "HLA-DQB1" "TCL1A"
## [7] "LINC00926" "VPREB3" "HLA-DQA2" "FCER2" "BANK1" "TSPAN13"
## [13] "HLA-DOB" "FCRLA" "HVCN1" "CD37" "PKIG" "GNG7"
## [19] "EAF2" "SPIB"
Both plot_heatmap() and plot_ggheatmap()
support the use_top_genesargument:
plot_ggheatmap(pbmc_scigenex,
use_top_genes = TRUE,
ident=Seurat::Idents(pbmc3k)) + ggtitle("All clusters (top genes)") +
theme(strip.text.y = element_text(size=4))
A very interesting feature of Scigenex is its ability to display gene
expression levels across cells through interactive heatmaps. Using this
feature, user can interactively assess the expression levels in selected
cells or cluster or on the whole dataset. However, it is generally
advisable, when using all the clusters to restrict analysis using
top_genes=TRUE. Here we will display the expression levels
across cells of cluster 3 to 5.
plot_heatmap(pbmc_scigenex[3:5, ],
use_top_genes = TRUE,
cell_clusters =Seurat::Idents(pbmc3k))